Goto

Collaborating Authors

 transformer network







SupplementaryMaterial: LearningRepresentations fromAudio-VisualSpatialAlignment

Neural Information Processing Systems

These are transformer networks of base dimension 512 and expansion ration 4. In other words,7 the output dimensionality of the linear transformations of parametersWkey,Wqr,Wval,W0 and8 W2 are 512, and that ofW1 is 2048. Models are pre-trained to optimize loss (7) for AVC task or9 (9)forAVTSandAVSAtasks. Asoriginallyproposed,15 lateral connections are implemented with a1 1 convolution that maps all feature maps into a16 128 dimensional space followed by a3 3convolution for increased smoothing. Thus, all pixels for which the state-of-the-art model was less25 than 75% confident were kept unlabeled. These low confidence regions were also ignored while26 computingevaluationmetrics.


Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction

Neural Information Processing Systems

Vision transformer networks have shown superiority in many computer vision tasks. In this paper, we take a step further by proposing a novel generative vision transformer with latent variables following an informative energy-based prior for salient object detection. Both the vision transformer network and the energy-based prior model are jointly trained via Markov chain Monte Carlo-based maximum likelihood estimation, in which the sampling from the intractable posterior and prior distributions of the latent variables are performed by Langevin dynamics. Further, with the generative vision transformer, we can easily obtain a pixel-wise uncertainty map from an image, which indicates the model confidence in predicting saliency from the image. Different from the existing generative models which define the prior distribution of the latent variables as a simple isotropic Gaussian distribution, our model uses an energy-based informative prior which can be more expressive to capture the latent space of the data. We apply the proposed framework to both RGB and RGB-D salient object detection tasks. Extensive experimental results show that our framework can achieve not only accurate saliency predictions but also meaningful uncertainty maps that are consistent with the human perception.


NxMTransformer: Semi-Structured Sparsification for Natural Language Understanding via ADMM

Neural Information Processing Systems

Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained Transformer networks. However, these models often contain hundreds of millions or even billions of parameters, bringing challenges to online deployment due to latency constraints. Recently, hardware manufacturers have introduced dedicated hardware for NxM sparsity to provide the flexibility of unstructured pruning with the runtime efficiency of structured approaches. NxM sparsity permits arbitrarily selecting M parameters to retain from a contiguous group of N in the dense representation. However, due to the extremely high complexity of pre-trained models, the standard sparse fine-tuning techniques often fail to generalize well on downstream tasks, which have limited data resources.



InEKFormer: A Hybrid State Estimator for Humanoid Robots

Hohmeyer, Lasse, Popescu, Mihaela, Bergonzani, Ivan, Mronga, Dennis, Kirchner, Frank

arXiv.org Artificial Intelligence

Humanoid robots have great potential for a wide range of applications, including industrial and domestic use, healthcare, and search and rescue missions. However, bipedal locomotion in different environments is still a challenge when it comes to performing stable and dynamic movements. This is where state estimation plays a crucial role, providing fast and accurate feedback of the robot's floating base state to the motion controller. Although classical state estimation methods such as Kalman filters are widely used in robotics, they require expert knowledge to fine-tune the noise parameters. Due to recent advances in the field of machine learning, deep learning methods are increasingly used for state estimation tasks. In this work, we propose the InEKFormer, a novel hybrid state estimation method that incorporates an invariant extended Kalman filter (InEKF) and a Transformer network. We compare our method with the InEKF and the KalmanNet approaches on datasets obtained from the humanoid robot RH5. The results indicate the potential of Transformers in humanoid state estimation, but also highlight the need for robust autoregressive training in these high-dimensional problems.